(N 
(N 



O 



X 



Adapting to the Shifting Intent of Search Queries' 



Umar Syed^ Aleksandrs Slivkins Nina Mishra 

Department of Computer Microsoft Research Microsoft Research 

and Information Science Mountain View, CA 94043 Mountain View, CA 94043 

University of Pennsylvania slivkinsSmicrosoft . com ninamgmicrosoft . com 



Philadelphia, PA 19104 

(^ ' usyed@cis.upenn.edu 

(N 



Abstract 



Search engines today present results that are often oblivious to abrupt shifts in 

intent. For example, the query 'independence day' usually refers to a US holiday, 

but the intent of this query abruptly changed during the release of a major film 

r^ ■ by that name. While no studies exactly quantify the magnitude of intent-shifting 

J I traffic, studies suggest that news events, seasonal topics, pop culture, etc account 

^i , for 50% of all search queries. This paper shows that the signals a search engine re- 

Y^ • ceives can be used to both determine that a shift in intent has happened, as well as 

' find a result that is now more relevant. We present a meta-algorithm that marries a 

classifier with a bandit algorithm to achieve regret that depends logarithmically on 

the number of query impressions, under certain assumptions. We provide strong 

^ I evidence that this regret is close to the best achievable. Finally, via a series of 

0^ , experiments, we demonstrate that our algorithm outperforms prior approaches, 

0^ ■ particularly as the amount of intent-shifting traffic increases. 

cn 

C^r ' 1 Introduction 

O 



Search engines typically use a ranking function to order results. The function scores a document by 
the extent to which it matches the query, and documents are ordered according to this score. Usually, 
this function is fixed in the sense that it does not change from one query to another and also does not 
change over time. 



H I Intuitively, a query is "intent-shifting" if the most desired search result(s) change over time. More 

concretely, a query's intent has shifted if the click distribution over search results at some time 
differs from the click distribution at a later time. For the query 'tomato' on the heels of a tomato 
salmonella outbreak, the probability a user clicks on a news story describing the outbreak increases 
while the probability a user clicks on the Wikipedia entry for tomatoes rapidly decreases. There 
are studies that suggest that queries likely to be intent-shifting — such as pop culture, news events, 
trends, and seasonal topics queries — constitute roughly half of the search queries that a search 
engine receives [[TOl . 

The goal of this paper is to devise an algorithm that quickly adapts search results to shifts in user 
intent. Ideally, for every query and every point in time, we would like to display the search result that 
users are most likely to click. Since traditional ranking features like PageRank ID change slowly 
over time, and may be misleading if user intent has shifted very recently, we want to use just the 
observed click behavior of users to decide which search results to display. 

There are many signals a search engine can use to detect when the intent of a query shifts. Query 
features such as as volume, abandonment rate, reformulation rate, occurrence in news articles, and 



*This is the full version of a paper in NIPS 2009. 

^This work was done while the author was an intern at Microsoft Research and a student in the Department 
of Computer Science, Princeton University. 



the age of matching documents can all be used to build a classifier which, given a query, determines 
whether the intent has shifted. We refer to these features as the context, and an occassion when a 
shift in intent occurs as an event. 

One major challenge in building an event classifier is obtaining training data. For most query and 
date combinations (e.g. 'tomato, 06/09/2008'), it will be difficult even for a human labeler to recall 
in hindsight whether an event related to the query occurred on that date. In this paper, we propose a 
novel solution that learns from unlabeled contexts and user click activity. 

Contributions. We describe a new algorithm that leverages the information contained in contexts, 
provided that such information is sufficiently rich. Specifically, we assume that there exists a de- 
terministic oracle (unknown to the algorithm) which inputs the context and outputs a correct binary 
prediction of whether an event has occurred in the current round. To simulate such an oracle, we use 
a classification algorithm. However, we do not assume that we have a priori labeled samples to train 
such a classifier Instead, we generate the labels ourselves. 

Our algorithm is in fact a meta-algorithm that combines a bandit algorithm designed for the event- 
free setting with an online classification algorithm. The classifier uses the contexts to predict when 
events occur, and the bandit algorithm "starts over" on positive predictions. The bandit algorithm 
provides feedback to the classifier by checking, soon after each of the classifier's positive predic- 
tions, whether the optimal search result actually changed. In such a setup, one needs to overcome 
several technical hurdles, e.g. ensure that the feedback is not "contaminated" by events in the past 
and in the near future. We design the whole triad — the bandit algorithm, the classifier, and the 
meta-algorithm — so as to obtain strong provable guarantees. Our bandit subroutine — a novel 
version of algorithm UCB 1 from ||2j which additionally provides high-confidence estimates on the 
suboptimality of arms — may be of independent interest. 

For suitable choices of the bandit and classifier subroutines, the regret incurred by our meta- 
algorithm is (under certain mild assumptions) at most 0{k + dj^){^ logT), where k is the number 
of events, djr is a certain measure of the complexity of the concept class F used by the classifier, 
n is the number of relevant search resultsHl A is the "minimum suboptimality" of any search result 
(defined formally in Section[3]l, and T is the total number of impressions. This regret bound has a 
very weak dependence on T, which is highly desirable for search engines that receive much traffic. 

The context turns out to be crucial for achieving logarithmic dependence on T. Indeed, we show that 
any bandit algorithm that ignores context suffers regret Q,{\/T), even when there is only one event. 
Unlike many lower bounds for bandit problems, our lower bound holds even when A is a constant 
independent of T. We also show that assuming a logarithmic dependence on T, the dependence on 
k and djr is essentially optimal. 

For empirical evaluation, we ideally need access to the traffic of a real search engine so that search 
results can be adapted based on real-time click activity. Since we did not have access to live traf- 
fic, we instead conduct a series of synthetic experiments. The experiments show that if there are 
no events then the well-studied UCB 1 algorithm [2] performs the best. However, when many dif- 
ferent queries experience events, the performance of our algorithm significantly outperforms prior 
methods. 



2 Related Work 

While there has been a substantial amount of work on ranking algorithms lfm i5l[T3l[8ll6l. all of these 
results assume that there is a fixed ranking function to learn, not one that shifts over time. Online 
bandit algorithms (see Q for background) have been considered in the context of ranking. For 
instance, Radlinski et al 1201 showed how to compose several instantiations of a bandit algorithm 
to produce a ranked list of search results. Pandey et al ||T9l showed that bandit algorithms can 
be effective in serving advertisements to search engine users. These approaches also assume a 
stationary inference problem. 

Although no existing bandit algorithms are specifically designed for our problem setting, there are 
two well-known algorithms that we compare against in this paper. The UCB 1 algorithm 0| assumes 
fixed click probabiHties and has regret at most 0(^ logT). The EXP3.S algorithm JS) assumes 



In practice, the arms can be restricted to, say, the top ten results that match the query. 



that click probabilities can change on every round and has regret at most 0{kyJnT \og{nT)) for 
arbitrary pt's. Note that the dependence of EXP3.S on T is substantially stronger. 

The "contextual bandits" problem setting ll22l [TSl [T2l [TTl [141 is similar to ours. A key difference 
is that the context received in each round is assumed to contain information about the identity of 
an optimal result i'^, a considerably stronger assumption than we make. Our context includes only 
side information such as volume of the query, but we never actually receive information about the 
identity of the optimal result. 

A different approach is to build a statistical model of user cUck behavior. This approach has been 
applied to the problem of serving news articles on the web. Diaz O used a regularized logistic 
model to determine when to surface news results for a query. Agarwal et al H] used several models, 
including a dynamic linear growth curve model. 

There has also been work on detecting bursts in data streams. For example, Kleinberg lITSI describes 
a state-based model for inferring stages of burstiness. The goal of our work is not to detect bursts, 
but rather to predict shifts in intent. 



In a recent concurrent and independent work, Yu et al 11231 studied bandit problems with "piecewise- 
stationary" distributions, a notion that closely resembles our definition of events. However, they 
make different assumptions than we do about the information a bandit algorithm can observe. Ex- 
pressed in the language of our problem setting, they assume that from time-to-time a bandit algo- 
rithm receives information about how users would have responded to search results that are never 
actually displayed. For our setting, this assumption is clearly inappropriate. 

3 Problem Formulation and Preliminaries 

We view the problem of deciding which search results to display in response to user click behavior 
as a bandit problem, a well-known type of sequential decision problem. For a given query q, the 
task is to determine, at each round t e {1, . . . , T} that q is issued by a user to our search engine, a 
single result it £ {1, . . . , n} to displayQ This result is clicked by the user with probability pt{it)- 
A bandit algorithm A chooses it using only observed information from previous rounds, i.e., all 
previously displayed results and received clicks. The performance of an algorithm A is measured 

by its regret: Rj^{T) = E J2t=iPt{'^t) ~ Pti^t) , where an optimal result it E a.Ygina.XiPt{i) is 
one with maximum click probability, and the expectation is taken over the randomness in the clicks 
and the internal randomization of the algorithm. Note our unusually strong definition of regret: we 
are competing against the best result on every round. 

We call an event any round t where pt-i ^ pt- It is reasonable to assume that the number of 
events k <^ T, since we believe that abrupt shifts in user intent are relatively rare. Most existing 
bandit algorithms make no attempt to predict when events will occur, and consequently suffer regret 
n{VT). On the other hand, a typical search engine receives many signals that can be used to predict 
events, such as bursts in query reformulation, average age of retrieved document, etc. 

We assume that our bandit algorithm receives a context xt G X at each round t, and that there exists 
a function / e J^, in some known concept class T, such that f{xt) = +1 if an event occurs at 
round t, and f{xt) = —1 otherwisejj In other words, / is an event oracle. The tractability of J^ will 
be characterized by a number djr called the diameter of T, detailed in Section |5] At each round t, 
an eventful bandit algorithm must choose a result it using only observed information from previous 
rounds, i.e., all previously displayed results and received cHcks, plus all contexts up to round t. 

In order to develop an efficient eventful bandit algorithm, we make an additional key assumption: At 
least one optimal result before an event is significantly suboptimal after the event. More precisely, 
we assume there exists a minimum shift 65 > such that, whenever an event occurs at round t, 
we have pt{i't_i) < Pt{it) ~ ^s for at least one previously optimal search result if^i- For our 
problem setting, this assumption is relatively mild: the events we are interested in tend to have a 



^For simplicity, we focus on the task of returning a single result, and not a list of results. Techniques 
from 1201 may be adopted to find a good list of results. 

""in some of our analysis, we require contexts be restricted to a strict (concept-specific) subset of X; the 
value of / outside this subset will technically be null. See Section|5]for more details. 



Table 1: Notation 
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universe of contexts 
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context in round t 
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number of events 
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time horizon 

min shift of an event 

min suboptimality of an arm 



rather dramatic effect on the optimal search results. Moreover, our bounds are parameterized by 

A = mint miiij^i. pt [i^ ) — pt (i), the minimum suboptimality of any suboptimal result. 

We summarize the notation in Table [T] 

Let S be the set of all contexts which correspond to an event. When the classifier receives a context 
X and predicts a "positive", this prediction is called a true positive if x € S, and a false positive 
otherwise. Likewise, when the classifier predicts a "negative", the prediction is called a true negative 
if X ^ S, and a false negative otherwise. The sample {x, I) is correctly labeled if Z = (a; € S). 

4 Bandit with Classifier 

Our algorithm is called BWC, or "Bandit with Classifier". Ideally, we would like to use a bandit 
algorithm for the event-free setting, such as UCB 1 , and restart it every time there is an event. Since 
we do not have an oracle to tell whether an event has happened, we use a classifier which looks at 
the current context and makes a binary prediction. As we mentioned in the introduction, we assume 
that a priori there are no labeled samples to train such a classifier, so we need to generate the labels 
ourselves. The high-level idea is to restart the bandit algorithm every time the classifier predicts an 
event, and use subsequent rounds to generate feedback (labeled samples) to train the classifier. Thus, 
we have a. feedback loop between the bandit algorithm and a classifier, in which the latter provides 
predictions and the former verifies whether they are correct, see Figure [T] 




bandit 



classifier 




Figure 1: A high-level picture of the operation of the BWC algorithm, depicting the feedback loop 
between the two main subroutines, bandit and classifier. 



So what prevents us from simply combining an off-the-shelf bandit algorithm with an off-the-shelf 
classifier? The central challenge is how to define the feedback. Let us outline several hurdles that we 
need to overcome here. A single false negative prediction will cause BWC to miss an event, which 
may result in a very high regret (since it may take the bandit algorithm a very long time to adjust). 
Incorrectly labeled samples may contaminate the classifier, perhaps even permanently. To generate a 
label for a given sample, one needs to compare the state right before the current round with the state 
right after, in a conclusive way. Both states are not known to the algorithm a priori, and can only be 
learned probabilistically via exploration. A particular challenge is to ensure that such exploration is 
not contaminated by events in the past rounds, as well as by events that happen soon after the current 
round. Moreover, this exploration is generally too expensive to perform upon negative predictions 
— indeed, the whole point of BWC is that in the absence of an event the bandit algorithm converges 



to the best arm and (essentially) keeps playing it — so the classifier receives labels only upon the 
positive predictions. 

4.1 The meta-algorithm 

We will present our algorithm in a modular way, as a meta-algorithm which uses the following two 
components: classifier and bandit. In each round, classifier inputs a context xt and 
outputs a "positive" or "negative" prediction of whether an event has happened in this round. Also, 
it may input labeled samples of the form (x, /), where a; is a context and I is a boolean label, which 
it uses for training. Algorithm bandit is a bandit algorithm that is tuned for the event-free runs. 

As described above, we further require bandit to provide feedback to the classifier about whether 
the best result has actually changed. The standard bandit framework does not immediately provide 
us with estimates from which such feedback can be obtained. Therefore we require bandit to 
provide the following additional functionality: after each round t of execution, it outputs a pair 
(G+, G^) of subsets of armsQ we call this pair the t-th round guess^ The meaning of G+ and 
G~ is that they are algorithm's estimates for, respectively, the sets of all optimal and (at least) eg- 
suboptimal arms. We use (G+, G^) to predict whether an event has happened between two runs of 
bandit. The idea is that any such event causes some arm from G+ of the first run to migrate to 
G~ of the second run. Accordingly, we generate a negative label if G^ n GJ = 0, where i and j 
refers to the first and the second run, respectively (see Line 10 of Algorithm [Tji. 

We formalize our assumptions on classifier and bandit as follows: 

Definition 1. classifier /i safe /or a given concept class if, given only correctly labeled sam- 
ples, it never outputs a false negative, bandit is called (L, e) -testable, for some L G N and 
e G (0, 1), if the following holds. Consider an event-free run o/bandit, and let {G^ , G") be its 
L-th round guess. Then with probability at least 1 — T^^, each optimal arm lies in G^ but not in 
G^ , and any arm that is at least e-suboptimal lies in G^ but not in G"*". u 

We will discuss efficient implementations of a safe classifier and a (L, e5)-testable bandit 
in Sections|5]and Section|6] respectively. For bandit, we build on a standard algorithm UCBl |l2]; 
as it turns out, making it (L, es)-testable requires a significantly extended analysis. 

For correctness, we require bandit to be {L, es)-testable, where eg is the minimum shift. The 
performance of bandit is quantified via its event-free regret, i.e. regret on the event-free runs. 
Likewise, for correctness we need classifier to be safe, and we quantify its performance via the 
following property, termed FP-complexity, which refers to the maximum number of false positives. 

Definition 2. Given a concept class J-', the FP-complexity o/ classifier is the maximum pos- 
sible number of false positives it can make in an online prediction game where in each round, an 
adversary selects a sample, classifier makes a prediction, and then (in some rounds) receives 
a correct label. Specifically, classifier receives a correct label if and only if the prediction is 
a false positive. The maximum is taken over all event oracles f ^ T and all possible sequences of 
samples. 

Now we are ready to present our meta-algorithm, called BWC. It runs in phases of two alternating 
types: odd phases are called "testing" phases, and even phases are called "adapting" phases. The first 
round of phase j is denoted tj . In each phase we run a fresh instance of bandit . Each testing phase 
lasts for L rounds, where L is a parameter Each adapting phase j ends as soon as classifier 
predicts "positive"; the round i when this happens is round t^+i. Phase j is called /m// if it lasts at 
least L rounds. For a full phase j, let (G^, G!^) be the L-th round guess in this phase. After each 
testing phase j, we generate a boolean prediction / of whether there was an event in the first round 
thereof. Specifically, letting i be the most recent full phase before phase f we set It = false if 

and only if G'l n G^ = 0. If Itj is false, the labeled sample {xt^ , hj ) is fed back to the classifier. 



''Following established convention, we call the options available to a bandit algorithm "arms". In our setting, 
each arm corresponds to a search result. 

^Both classifier and bandit make predictions (about events and arms, respectively). For clarity, we 
will use the term "guess" exclusively to refer to predictions made by bandit, and reserve the term "prediction" 
for classifier. 

^Recall that T here is the overall time horizon, as defined in Section[3l 



Algorithm 1 Meta-algorithm BWC ("Bandit with Classifier") 



Given: Parameter L, a (L, e5)-testable bandit, and a safe classifier. 
for phase j = 1, 2, . . . do 



Initialize bandit. Let tj be current round. 
if j is odd then 
{testing phase} 
for round t = tj ... tj+Ldo 

Select arm it according to bandit. 
Observe click w.p. pt{it) and update bandit. 
Let i be the most recent full phase before phase j. 
IfG+nG-=0 {label is false} 

Let It- = false and pass training example {xt.Jt) to classifier. 
else 

{adapting phase} 
for round t ~ tj, tj + I, . . . do 
Select arm it according to bandit. 

Observe click w.p. pt{it) and update bandit; pass context xt to classifier, 
if classifier predicts "positive" then 
Terminate inner for loop. 



Note that classifier never receives true-labeled samples. Pseudocode for BWC is given in 
Algorithm[T] 

Disregarding the interleaved testing phases for the moment, BWC restarts bandit whenever 
classifier predicts "positive", optimistically assuming that the prediction is correct. By our 
assumption that events cause some optimal arm to become significantly suboptimal (see Section |3]l, 
a correct prediction should result in G^ n GJ ^ ^, where i is a phase before the putative event, 
and j is a phase after it. We use this condition in Line 10 of the pseudocode to generate the label. 
However, to ensure that the estimates Gi and Gj are reliable, we require that phases i and j are full. 
And to ensure that the full phases closest to a putative event are not too far from it, we interleave a 
full testing phase every other phase. 

4.2 Provable guarantees 

We present provable guarantees for BWC in a modular way, in terms of FP-complexity, event-free 
regret, and the number of events. This is the main technical result in the paper 

Theorem 1. Consider an instance of the eventful bandit problem with number of rounds T, n arms, 
k events and minimum shift eg; assume that any two events are at least 2L rounds apart. Consider 
algorithm BWC with parameter L and components classifier and bandit that are, respec- 
tively, safe and (L, es)-testable. Suppose the event-free regret o/bandit is bounded from above 
by a concave function Ro{-). Then the regret of BWC is 

-Rbwc(T) < (2fc + dpp) i?o {wiTd^) + (^ + '^FP) ML) + kL, (1) 

where dpp is the FP-complexity of class±f±er. 

Remark. We define a safe classifier whose FP-complexity is bounded in terms of some properties 
of the underlying concept class (see Section|5]). Our instantiation of bandit is (L, es)-testable for 
L = 8(^ log T), with concave event-free regret matching that of UCB 1 (see Section|6]l. 

Remark. The right-hand side of ^ can be parsed as follows. The three summands in ^ correspond 
to contributions of, respectively, adapting phases, event-free testing phases, and testing phases dur- 
ing which an event has occurred. For the first summand, we show that BWC incurs regret i?o(i) for 
each adapting phase of length t, bound the number of adapting phases by 2fc + dpp, and then bound 
the total contribution of all such phases using concavity. For the second summand, we bound the 
number of clean testing phases by fc + dpp, and note that each such phase contributes at most Rq (L) 



to regret. For the third summand, each "eventful" testing phase contributes at most L to regret, and 
we show that there can be at most k such phases Q 

Remark. Assuming that any two events are at least 2L rounds apart ensures that of any two consec- 
utive phases, one much be event-free. This, in turn, let us invoke the {L, £5) -testability of bandit. 

Overview of the proof. The essential difficulty the analysis of BWC is that an event might happen 
while the algorithm is testing for another (suspected) event. The corresponding technical difficulty 
is that the correct operation of the components of BWC — classifier and bandit — is in- 
terdependent, so one needs to be careful to avoid a circular argument. In particular, one challenge 
is to handle events that occur during the first L rounds of a phase; these events may potentially 
"contaminate" the i-th round guesses and cause incorrect feedback to classifier. 

First, we argue away the probabilistic nature of the problem. We focus on a given testing phase j. 
For each of the two preceding phases i £ {j — l,j ~ 2}, consider the number N of events between 
the first round of phase i and the first round of phase j. We would like to establish the following 
separation property: that we can separate (tell apart) the cases of A^ = and iV = 1 using the testing 
condition in Line 10 of the pseudocode. Capitalizing on (L, e5)-testability, we define a technical 
condition which implies the separation property with very high probability. Regret incurred if the 
implication "technical condition => separation property" fails to hold is negligible. Thus we can 
assume that this implication holds always, and argue deterministically from now on. 

It is worth noting that we consider two preceding phases i e {j — 1,^ — 2} because either can be 
used in in Line 10 of the pseudocode (depending on whether phase j — 1 is full). A crucial point 
here is that one of these two phases must be event-free. 

Second, we argue that classifier receives only correctly labeled samples. We do it in two 
steps. Using the well-detectable property, we show that if classifier receives an incorrectly 
labeled sample after some testing phase j, then an event must have occurred during the (adapting) 
phase j — I. Then using the safety property of classifier, we prove that each adapting phase is 
event-free. 

Third, we bound from above the number of testing and adapting phases, using the maximal number 
of events and the FP-complexity of the classifier To this end, we establish that if during a testing 
phase j there are no events, and furthermore there are no events during the two preceding phases, 
then in the end of j BWC generates a correct label I = false. Then the regret bound ^ follows 
easily from the event-free regret of bandit. D 

Now let us present the full proof which fills the gaps in the above overview. 

Full Proof. Let tj be the first round of phase j. Recall that phase j is called full if it lasts at least 
L rounds. For a full phase j, let us say that the phase is event-free if no events happened during 
interval {tj, tj + L], and let (G^, GJ) be the L-th round guess in this phase. For two full phases 
i < j, let us write i © j if and only if Gf n GJ = ^. Recall that i® j (as a boolean property) is our 
algorithm's estimate of whether there was no event in round tj. 

A testing phase j is called well-detectable if for each phase i E {j — 2, 7 — 1} the following property 
holds: if phases i and j are full and event-free, then: (i) if there are no events in the interval {ti, tj] 
then i j, (ii) if in the interval {ti, tj] there is exactly one event, then ^(i j). Since bandit is 
{L, es)-testable, each testing phase j is well-detectable with probability at least 1 — 2T^^. Thus, 
with probability at least 1 — n{T^^) each testing phase is well-detectable. Thus, regret incurred in 
the case that a phase fails to be well-detectable is negligible. So in the rest of the proof, we will 
assume that each testing phase is well-detectable. 

We claim that if classifier receives an incorrectly labeled sample after some testing phase j, 
then an event must have occurred during the (adapting) phase j — 1. Indeed, by the algorithm 
specification this sample is {xt , false), where tj is the first round of phase j. Thus, an event has 
happened in round tj, and yet we have i (S) j, where i is the most recent full phase before phase j. 



'in fact, the k in the +fcL term in O can be replaced by the (potentially much smaller) number of testing 
phases that contain both a false positive in round 1 of the phase and an actual event later in the phase. 
'*In the full proof, this implication is called well-detectability. 



Since each testing phase is well-detectable, it follows that at least one more event happened between 
the beginning of phase i and the end of phase j. Since any two events are at least 2L rounds apart, 
phase i started at some round ti < tj — 2L, and an event has happened in the interval [ti, tj — 2L). 
To prove the claim, it suffices to show that i = j — 1. Now, if phase j — 1 lasted less than L steps, 
then i = j — 2 is a testing phase, and so ti > tj — 2L, contradiction. Thus phase j — I lasted at least 
L steps, and so i = j — 1, claim proved. 

We claim that all adapting phases are event-free. For the sake of contradiction, suppose an event 
occurs during an adapting phase, and let t be the first round at which this happens. We know that 
classifier output a (false) negative in this round, since otherwise a new testing phase would 
have started at round t. Since classifier is safe, at some round before t it must have received 
an incorrectly labeled sample. By the algorithm specification, this must have happened after some 
testing phase j which ended before round t. But then (by the previous claim) an event must have 
occurred during the (adapting) phase j — 1, which contradicts the choice of t. Claim proved. 

From the previous two claims, it follows that classi fier receives only correctly labeled samples. 

We claim that if there are no events during some testing phases j — 2 and j, then at the end of phase 
j we generate a label / = false. Indeed, suppose not. Then -^{i © j), where i is the most recent 
full phase before phase j. Either i = j — 2 or i = j — 1; in either case, -i(i © j) implies that there 
is an event in the interval [ti, tj). Since there are no events during adapting phases, it follows that 
i — j ~ 2, contradiction. Claim proved. 

We claim that there can be at most 2k + dpF testing phases (and hence at most as many adapting 
phases), including at most k + dpF event-free testing phases. Indeed, in the first round of each 
testing phase j classifier generates a "positive", and in the end of the phase we generate a 
label / e {true, false}. We examine each case separately: (i) if / = false then classifier 
receives feedback, so there can be at most dpp such phases j, (ii) if / — true then an event has 
occurred in phase j or j — 2, so there can be at most 2k such phases j, of which at most k phases 
can be event-free. Claim proved. 

To obtain the regret bound ([T]i, note that regret in each event-free phase of length t is Rq (t), see the 
second remark after Theorem[T]for details. D 



5 Safe Classifier 

In this section, we show how safe classifiers with low FP-complexity can be constructed for specific 
concept classes. Recall that a classifier is called safe if (assuming it inputs only correctly labeled 
samples) it never outputs a false negative, and the definition of FP-complexity, motivated by the 
specification of the BWC algorithm, essentially assumes that all labeled samples correspond to false 
positives. 

We first describe a generic classifier, called SafeCl, that is safe for any concept class T, and 
bound its FP-complexity using a certain property of T. In the event that the concept class is all d- 
dimensional axis-parallel hyper-rectangles with margin l/S, we show that this bound is proportional 
to d/6. And in the event that the concept class is all d-dimensional hyperplanes with margin S, we 
show that this bound is exponential in d. Unfortunately, the exponential dependence cannot be 
improved, as we will see in Section]?] 

The classifier SafeCl is defined as follows. 



SafeCl classifies a given unlabeled context x as negative if and only if there exists no concept 
f E J- such that f{x) = +1 and f{x') = — 1 for each false-labeled example x' received so 
far 



It is easy to see that this classifier is indeed safe. Moreover, we bound its FP-complexity in terms of 
the following property of the concept class T: 

Definition 3. The diameter of J-, denoted djr, is equal to the length of the longest sequence 
cci, . . . , Xm G X such that for each t = 1, . . . ,m there exists a concept f £ T with the follow- 
ing property: f{xt) = +1, and f{xs) = —1 for all s < t. 

Claim 1. SafeCl is safe, and its FP-complexity is at most djr. 



Proof. Assume all false-labeled examples input by SafeCl are correctly labeled. Suppose 
Saf eCl outputs a false negative, with concept f E T and unlabeled sample x. Then f{x) = +1 
and f{x') = —1 for each false-labeled example x' received so far. But by definition of 
SafeCl such concept does not exist, contradiction. Therefore, SafeCl is safe. Regarding the 
FP-complexity, consider the prediction game in Definition [J] Any sequence xi, . . . , Xm of false 
positives output by SafeCl satisfies the property in Definition[3] so m < dj^. D 

By using SafeCl as our classifier, we introduce djr into the regret bound of bwc, and this quantity 
can be large. However, in Section|2]we show that the regret of any algorithm must depend on djr, 
unless it depends strongly on the number of rounds T. 

Below we give examples of common concept classes with efficiently computable safe functions, 
and prove bounds on their diameter. Recall that for a given universe X of examples, a concept is 
a function / : A" — > { — 1, +1, null}, where the null value refers to the examples that are not 
feasible under a given concept (i.e., if / is the true concept, then we will never observe an example 
X such that f{x) = null). 

In what follows, for each N C X define Sjr{N) C A" as the set of all x G A" for which there is no 
concept f E J- such that f{x) — +1 and f{x') = —1 for each x' £ N. Note that SafeCl outputs 
a negative prediction on x if and only if a:; e Sjr{N), where N is the set of false-labeled samples 
received so far. Likewise, in Definition [3] the sequence {xt} satisfies xt ^ Sj:-{{xi, . . . , Xt-i}) for 
each t. 

For convenience, define a "J-ball" around a set 5* C M'* in the d-dimensional Lp-norm as 

M'^iS, S)^{xeR'^ : L.p{x, S) < d}, where Lp{x, S) = min^gs ||a: - y\\p. 
Here Lp{x, S) is the Lp-norm distance between a point x and a set S. 

5.1 Axis-parallel rectangles with margin S 

One very simple concept is an axis-parallel hyper-rectangle. This type of concept can be used to 
test whether any one of several features is outside of its 'normal' range. This is a particularly well- 
suited concept class for predicting events that may affect a search engine query, since these events 
are typically preceded by a large change in some statistic related to the query, such as its volume or 
abandonment rate. 

Fix the dimension d, and let A" C M'' be the d-dimensional Loo-norm unit ball around the origin. A 
d-rectangle in W^ is the cross-product of d non-empty intervals in R. Given 5 > Q and a d-rectangle 
R, define a function fn^s ■ X -^ {—1, +1, null} as follows; fR^s{x) equals +1 if Loo{x, R) > 6; 
it equals — 1 if x G R, and it equals null otherwise (note that the margin 5 only applies only 
outside of R). The concept class of d-dimensional axis-parallel rectangles with margin d is defined 
as J"APR(rf,5-) ^ {Jrj : all d-rectangles i?}. 

We bound the diameter of J^apr (d, i5) as follows. 
Claim 2. IfT = J^APR(d,5), then djr < 0{d/6). 

Proof. Consider a sequence a; 1, . . . ,Xm € A such that xt ^ Sjr{{xi, . . . ,Xt-i}) for all 1 < t < m. 
Let Rt be the i5-ball in Loo around the smallest d-rectangle containing xi, . . . ,xt. By definition of 
the sequence, at least one of the one-dimensional intervals defining Rt+i must be S larger than the 
same interval in Rt- Since ||a;t||tx3 < l,m < 0{d/S). D 

Clearly, for the concept class J^APR(d,5), the classifier SafeCl simply maintains the smallest d- 
dimensional rectangle R{N) containing the set of all previously false-labeled examples A^, and 
classifies a new example x as negative if and only if x lies within 6 (measured in Loo -norm) of 
R{N). In other words 



SafeCl on J^APR{d,S) '■ classify a; e A as negative <=> x G M'^{R{N), S), 
where N is the set of all f al se-labeled examples received so far. 



5.2 Hyperplanes with margin S 

Hyperplanes are perhaps the most widely-used concept in classification problems. Fix the dimension 
d, and let <Y C K'' be the d-dimensional i2-norm unit ball around the origin. Given u,w G M'' 
and S > 0, define a function fu.w.s '■ X -^ { — l,+l,null} as follows: fu.w,six) equals +1 if 
w ■ {x + u) > S,it equals —1 if w ■ {x + u) < —6, and it equals null otherwise. Here w is the unit 
normal of the hyperplane, and u is the shift vector. The concept class of d-dimensional hyperplanes 
with margin S is defined as 

J^RYPidj) = {fu.wj : u,w e M"*, ||mJ|2 < 1, \\w\\2 = 1}. 
We bound the diameter of J-"hyp (d, i5) as follows: 
Claim 3. //J" = J'HYP(d,S), then d.^ < (1 + \Y- 

Proof. Consider a sequence a; 1, . . . ,Xm G '^ such that xt ^ Sjr{{xi, . . . ,xt_i}) for all 1 < t < m, 
as in Definition[3] Then for each s and t such that 1 < s < t < m there exist u, w G M"^ such that 
|jw||2 < 1, ||wj|2 = 1, w ■ {xt + u) > 5 and w ■ {xg + u) < —6. By Holder's inequality, it follows 
that 

\\xt - Xsh = ||w||2 \\xt - Xs\\2 >w-{xt- Xs) > 2(5. (2) 

Now, place an L2-ball of radius 6 around each point xt- By (|2]i, none of these balls can intersect. A 
radius-r ball in d dimensions has volume Cd r'^, where Cd is a constant that depends only on d. Thus 
the total volume of the balls is m Cd 5'''. On the other hand, \\xt\\2 < 1 for each t, so each of these 
balls lies in the radius-(l + S) ball around the origin, so their total volume is at most Cd (1 + S)'^. It 
follows that m<{l + jy. D 

We now show that there is a computationally efficient way to implement the classifier Saf eCl for 
hypothesis class J^hyp id.S) ■ Specifically, we show that the classifier Saf eCl simply maintains the 
convex hull Co{N) of all previously false-labeled examples N, classifies a new example x as 
negative if and only if x lies within 2d (measured in L2-norm) of Co(iV). In other words 



SafeCl on J-eyf {d,S] '■ classify a; G A' as negative <;=> x G B2(Co(A^), 26), 
where A^ is the set of all f al se-labeled examples received so far 



Claim 4. IfT = THYP(d,5) and N C X then Sjr{N) = Xn Bf(Co(A^), 26), where Co{N) is the 
convex hull of N. 

Proof. Fix Xt G X. We divide the proof into two parts. First, we show that if xt is contained in the 
2(5-ball around Co(A^), then no hyperplane in F can separate xt from N . Next, we show that if xt 
is outside the 2(5-ball around Co(A^), then at least one hyperplane in F separates Xt from N . More 
precisely, we prove that 

(i) If Xt G B2(Co(iV), 26) then there does not exist / G J^hyp {d,S) such that f{xs) — —1 for 
alla;^ G iVand/(xt) = +1. 

(ii) If Xt i- B2(Co(iV), 2(5) then there exists / G Twi?(d.i) such that f(xs) = -1 for all 
Xs G iVand/(a;t) = +1. 

Proof of (i): Suppose for contradiction that there exist u,w G R'^, with J|m||2 < 1 and j|wj|2 = 1, 
such that If • {xs + u) < —6 for all Xg E N and w ■ {xt + u) > 6. 

Choose X* G Co(A^) so that \\xt — x*\\2 ~ ^2(2;*, Co(A^)), i.e. x* is a closest point in Co{N) 
to Xt (we know x* exists because Co(A^) is closed). Since Xt G B2(Co(A^), 26), we have that 
\\xt^x*\\2<26. 

We know that w ■ {x* + u) < —6 because x* is a convex combination of the examples in N. 
Therefore, by the intermediate value theorem, there exists x' G X and G [0,1] such that x' = 

(1 - 0)xt + Ox* and w {x' + u) = 0. 
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Some algebra shows that \\xt - x'\\2 ~ 0\\xt — x*\\2 and ||a;' — a-*||2 = (1 - d)\\xt - x*\\2- Adding 
these equations yields 

||a;t-x'||2 + ||a;'-a;*||2 = ||.Tt-a:*||2 

Because w ■ {x' + u) = 0, by Holder's inequality we have 

\\xt - x'\\2 = lliwlblla^t - x'\\2 > w ■ {xt- x') = w ■ {xt + u) - w ■ {x' + u) > S 
and 

\\x' ~X*h = Ilw|l2|l2:'-2:*|l2 > W ■ {x' ~ X*) ^ W {x' + U) - W ■ {x* + u) > S 

which implies \\xt — x*\\2 > 2S, which is a contradiction. 



Proof of (ii): We will use the well-known separating hyperplane theorem f2Tl: If nonempty convex 
sets X, y e M'' do not intersect, then there exist a eW^X {0} and 6 e M such that 

a ■ X > b for all a; G X and a ■ y < b for all y G F (3) 

Since xt ^ B^iCoiN), 26) there must exist e > such that the sets X = B^dxt}, 6) and Y = 
^2 (Co(iV), S + e)do not intersect. For these choices for X and Y, let us fix a e M'' \ {0} and 6 G K 
that satisfy O). 

Note that xt + z e X for all z G R'' such that ||2;||2 < (5. Also note that Xs + z eY for all Xg e N 
and z eR''- such that ||z||2 < ^ + e. So by Q we have 

a ■ [a;* — S ] > b and a ■ \ Xs + (5 + e) ] < b for all Xg £ N 

V ImhJ V Il«ll2/ 

Letting w = j^ and rearranging we have 

b ^ b , ^ 
w ■ Xt > T]— 7] — h and w ■ Xs < j^, [o + e) for all Xs & N (4) 



l|a||2 ||a||2 



Since ||w||2 = 1 and ||.t||2 < 1 for all a; G A", it follows from (|4|l that 

2 



u G M'' such that llulb < 1 and w ■ u = — ^V- It now follows that 



< 1. Thus there exists 



a\\2 

w ■ {xt + u) > 5 and w ■ {xg + u) < —6 — e for all Xg G N 
So the function fu.w.s G -^hyp (d, S) satisfies the claim. D 

6 Testable Bandit Algorithms 

In this section we will consider the stochastic n-armed bandit problem. We are looking for (L, e)- 
testable algorithms with low regret. The L will need to be sufficiently large, on the order ofi^{ne~^) . 

A natural candidate would be algorithm UCB 1 from |]2] which does very well on event-free regret: 

i?o(i)<0(min(f logL, ^nLlogL)). (5) 

Unfortunately, UCBl does not immediately provide a way to define the i-th round best guess 
(G+, G^) so as to guarantee (L, e)-testability. One simple fix is to choose an arm at random in 
each of the first L rounds, use these samples to form the best guess, in a straightforward way, and 
then run UCBl. However, in the first L rounds this algorithm incurs regret of n{L), which is very 
suboptimal compared to Ro{L) from (|5]l. 

In this section, we develop an algorithm which has the same regret bound as UCBl, and is (L, e)- 
testable. We state this result more generally, in terms of estimating expected payoffs; we believe it 
may be of independent interest. The {L, e)-testability is then an easy corollary. 

Since our analysis in this section is for the event-free setting, we can drop the subscript t from much 
of our notation. Let p(u) denote the (time-invariant) expected payoff of arm u. Letp* = inaxup{u), 
and let A(m) = p* — p{u) be the "suboptimality" of arm u. For round t, let pt{u) be the sample 
average of arm u, and let rit [u) be the number of times arm u has been played. 
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Algorithm 2 The (L, e)-testable bandit algorithm with low regret. 



Given: Time horizon T, parameter e e (0, 1). 
for all arms u do 

n{u) <— 0, x{u) <— 0, ii{u) <— {#samples, total reward, sample average} 
for rounds i = 1, 2, . . . , T do 

Pick arm u with the maximal index I{u) ~ fi{u) + \2J ;^°f'„(„-) ■ 

Observe payoff x, update n{u) <— n{u) + 1, x{u) <— x{u) + x, ii{u) -h- x{u)/n{u). 

{ Form the t-th round guess } 

V* <— arm played most often in the last t/2 rounds. 

for all arms v do 

A(w) <— /i(u*) — /i(w) {the t-th round estimate of A(-y)} 

Output (G+.G-) = {{v : A(v) < e/4}, {w : A(z;) > e/2}^ 



We will use a slightly modified algorithm UCBl from 12], with a significantly extended analysis. 
Recall that in each round t algorithm UCBl chooses an arm u with the highest iyidex It{u) = 
/if (u) + Tt (m), where r^ (u) = ^ ^\og{t) / nt{u) is a term that we'll call the confidence radius whose 
meaning is that \p{u) — /if (u) | < rt{u) with high probability. For our purposes here it is instructive 
to re-write the index as It (u) = /if (u) + art (u) for some parameter a. Also, to better bound the 
early failure probability we will re-define the confidence radius as rt (u) = y/8\og{to +t)/nt{u) 
for some parameter to. We will denote this parameterized version by UCB 1 (a, io)- 

The original regret analysis of UCB 1 in 111 carries over to UCB 1 (a, to ) so as to guarantee event- free 
regret (|5]i; we omit the details. 

Our contribution concerns estimating the A(m)'s. We estimate the maximal expected reward p* via 
the sample average of an arm that has been played most often. More precisely, in order to bound the 
failure probability we consider an arm that has been played most often in the last t/2 rounds. For a 
given round t let vt be one such arm (ties broken arbitrarily), and let At(u) = fJ,t{vt) — fJ-tiu) will 
be our estimate of A(u). This estimate (and the provable guarantee thereon) is the main technical 
contribution of this section. 

We obtain an {L, e)-testable algorithm from UCBl (6, T), where T is the time horizon, by defining 
the t-th round guess as 

{G+,G-) = {{v : At{v) < e/4}, {v : At{v) > e/2}). (6) 

The pseudocode is in Algorithm|2l 

Let us pass to the provable guarantees. We express the "quality" of our estimate Af as follows: 

Theorem 2. Consider the stochastic n-armed bandits problem. Suppose algorithm UCB 1(6, io) h<^s 
been played for t steps, and t + to > 32. Then with probability at least 1 — {to + t)^'^ for any arm 
u we have 

\A{u) - At{u)\ < \A{u) + 5{t) (7) 



where 6{t) = 0(^f log(i + io))- 

Remark. Either we know that A(w) is small, or we can approximate it up to a constant factor 
Specifically, if (5 (t) < i Af(M) then A(u) < 2At(u) < 5 A(u) else A(m) < A5{t). 

Proof. Fix round t, let v* = vt and let s be the last round this arm has been played before round t. 
Recall that s > t/2 by definition of Uf. Since by pigeonhole principle nt{v*) > ^, it follows that 

Ttiv) < 0{6) where 5 = -^j log(t + to)- It is easy to see that 

r,iv*) <2r,+i{v*) <2rt{v*) = OiS). 

Then with probability at least 1 — (to + t)^^ for any arm u we have 

piv*) + OiS) > p{v*) + 7rs{v*) > Isiv*) > Is{u) > p{u) + brs{u). (8) 

If u* is the arm with maximal expected reward, then plugging u = ii* into ([8j gives A(?j*) < 0{5). 
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We claim that dH) implies rt{u) < j A(u) + 0{5). Indeed, we can re-write ^ as 

5rs(u) < p{v*)-p{u) + 0(S) < A{u) + 0{S). 

The claim follows since rt{u) < rs{u) log (to +0/ log(^o + *) < f ''s(^t)- 

Now we are ready for the final calculation. Let p* be the maximal expected reward. Then 

\A{u) - At{u)\ = \p* - p{u) - ^lt{v*) + ^lt{u)\ 

= \ip*-p{vn) + {p{v*) - Mv*)) + (/it(u) -p(w))i 

< A(t;*) + \piv*) - iit(v*)\ + \i,t(u)-p(u)\ 

< A{v*) + rt{v*) + rt{u*) < \ A[u) + 0{5). U 

Finally, let us prove that Algorithm|2]is {L, e)-testable as long as L > Vt{^ logT). 

Theorem 3. Consider algorithm UCB 1 (6, T) where T is the time horizon and the t-th round guess is 
given by (jd]). Assume that S{L) < e/4, where 5{t) is from 0. Then the algorithm is {L, e)-testable. 

Proof. If u is an optimal arm, then A(u) = 0, so by d?) we have At{u) < 6{t) < e/4. If A{u) > e 
thenby O wehave Af(w) > A(ii)/2 > e/2. D 

7 Upper and Lower Bounds 

Plugging the classifier from Section |5] and the bandit algorithm from Section |6] into the meta- 
algorithm from Section|4] we obtain the following numerical guarantee. 

Theorem 4. Consider an instance S of the eventful bandit problem with number of rounds T, 
n arms, k events, minimum shift eg, minimum suboptimality A, and concept class diameter djr. 
Assume that any two events are at least 2L rounds apart, where L = O(-^logr). Consider 

the BWC algorithm with parameter L and components classifier and bandit as presented, 
respectively, in Section\5\and Section^ Then the regret of BWC is 



RMT) < (3fc + 2d^)l + k^ (logT) 



^9 



While the linear dependence on n in this bound may seem large, note that without additional as- 
sumptions, regret must be linear in n, since each arm must be pulled at least once. In an actual 
search engine application, the arms can be restricted to, say, the top ten results that match the query. 

We now state two lower bounds about eventful bandit problems. Theorem |5] shows that in order 
to achieve regret that is logarithmic in the number of rounds, a context-aware algorithm is neces- 
sary, assuming there is at least one event. Incidentally, this lowerbound can be easily extended to 
prove that, in our model, no algorithm can achieve logarithmic regret when an event oracle / is not 
contained in the concept class F. 

Theorem 5. Consider the eventful bandit problem with number of rounds T, two arms, minimum 
shift es and minimum suboptimality A, where €s — A — €, for an arbitrary e S (0, ^). For any 
context-ignoring bandit algorithm A, there exists a problem instance with a single event such that 
regret Ra{T) > n{eVT). 



Proof. For simplicity, assume that N = \/T is an integer. Define problem instances Ii,Q <i < N 
as follows. In each of these instances, the T rounds are partitioned into N phases, each of length 
N . There are two arms, call them y and z. Set pt{y) = \ for all t. For the problem instance Iq, 
Pt{z) = ^ — f for all t. For problem instances I^, i > 1 set Pt{z) = \ — ^ in all phases j < i, and 
Pt{z) — I + e in all phases j > i. (Thus, in each instance Xi there is a single event that occurs in 
the first round of phase i.) 

Now, let Qi be the probability that on problem instance I^, arm z is chosen by algorithm A at least 
once during phase i. If qi > ^ for each phase i, then on the problem instance Iq each phase i 
contributes at least e/2 to regret, so the total regret is at least eN/2. Otherwise, Qi < ^ for some i. 
Since instances Iq ^nd Ij coincide on the first t — 1 phases, algorithm A behaves the same way on 
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both instances up to the end of phase i — 1. Moreover, A behaves the same way on both instances 
throughout phase i assuming that it never plays arm z during that phase. Therefore with probabiHty 
1 — Qi its regret on instance Ij due to phase i alone is e per each round in this phase; so the total 
regret is at least eN/2. D 

Theorem |6] proves that in Theorem |4] linear dependence on k + djr is essentially unavoidable. If 
we desire a regret bound that has logarithmic dependence on the number of rounds, then a linear 
dependence on k + djr is necessary. 

Theorem 6. Consider the eventful bandit problem with number of rounds T and concept class 
diameter djr. Let A be an eventful bandit algorithm. 

(i) There exists a problem instance with n arms, k events, minimum shift 65, minimum sub- 
optimality A, where eg = A = e, for arbitrary k > 1, n > 3, and e £ (0, t), such that 
Ra{T) > n{k f ) log(T/fc). 

(ii) There exists a problem instance with two arms, a single event, minimum shift 0(1) and min- 
imum suboptimality 6(1) such that regret Rj({T) > il{T^'^) or Rj\^{T) > il{djr logT). 

Proof. For part (i), construct the family of problem instances as follows. In each instance, there are 
k phases of length T/k each. For each phase j, one arm, call it yi, has payoff pt{yi) = ^ + e, and 
all other arms y have payoff pt(y) = 5 — £• We have one problem instance for each sequence {yi} 
such that yi ^ yi^i for each i. Note that there is an event in the first round of each phase; without 
loss of generality let us assume that this is known to the algorithm. Then in each phase i > I the 
algorithm (essentially) needs to solve a fresh instance of the stochastic bandit problem onn — 1 arms 
with time horizon T/k and payoffs | ± e, which implies regret ^{j) log{T/k) |[16l|2l. We omit the 
easy formal details. 

For part (ii), partition the T rounds into N ~ inm{djr, T^^^) phases, each of length at least T^^^. 
We define problem instances li,0 < i < N,ina similar way as in Theorem|5] There are two arms, 
y and z. Set pt{y) ~ ^ for all t. For problem instance Iq. set pt{z) = jTrrTa- In problem instance 

Ii, for i > 1, seipt{z) = "'"i^, in all phases j < i, and pt{z) = /? ^^3 in all phases j > i. Note 

that TT-T7T < h < ' 



l+ei/3 ^ 2 ^ l+ei/3- 

In Appendix IaI we show how to define the context sequence \xt\ in a way consistent with all our 
assumptions, in such a way that the contexts for problem instances Iq and I, agree in the first i 
phases. The idea is that for both problem instances, the first round of each phase j < i triggers a 
false positive; this is possible since (essentially) we are allowed djr false positives. 

The rest of the proof involves calculations similar to those in the proof of Theorem[5] First, suppose 
djr > T^/^. Define qi as in the proof of Theorem|5] If (?i > 5 for each phase j, then for the problem 
instance Iq we have Rj({T) > Vl{T'^/'^). Otherwise, let i be such that qi < -i. By our construction, 
with probability 1 — qi algorithm A behaves identically on instances Zq and Ii through the first i 
phases. Thus, on instance Ii in phase i alone it incurs regret fl{l) per each round of the phase, for a 
total of i?^(T) > 0(r2/3), 

Next, suppose djr < T^'^. Let qij be the probability that for problem instance X,, arm z is chosen 
by A at least logT times during phase i. If gio > \ for each phase z, then Rj({T) > n{djr logT) 
on problem instance lo- Otherwise, let i be such that qi,o < i. In AppendixlAl we give a calculation 
that shows that (l-q,,j) > T^^/^^^-g.^o), which implies that i?^(T) > n{^){T'^^^-\ogT) > 
ri{T^^^) on problem instance Ii. D 

8 Experiments 

To truly demonstrate the benefits of B WC requires real-time manipulation of search results. Since we 
did not have the means to deploy a system that monitors click/skip activity and correspondingly al- 
ters search results with live users, we describe a collection of experiments on synthetically generated 
data. 
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Figure 2: (a) (Left) BWC's cumulative regret compared to UCBl and ORA (UCBl with an oracle 
indicating the exact locations of the intent-shifting event) (b) (Right, Top Table) Final regret (in 
thousands) as the fraction of intent-shifting queries varies. With more intent-shifting queries, BWC's 
advantage over prior approaches improves, (c) (Right, Bottom Table) Final regret (in thousands) as 
the number of features grows. 



We begin with a head-to-head comparison of BWC versus a baseline UCB 1 algorithm and show 
that BWC's performance improves substantially upon UCB 1 . Next, we compare the performance of 
these algorithms as we vary the fraction of intent-shifting queries: as the fraction increases, BWC's 
performance improves even further upon prior approaches. Finally, we compare the performance 
as we vary the number of features. While our theoretical results suggest that regret grows with the 
number of features in the context space, in our experiments, we surprisingly find that BWC is robust 
to higher dimensional feature spaces. 

Setup: We synthetically generate data as follows. We assume that there are 100 queries where the 
total number of times these queries are posed is 3M. Each query has five search results for a user 
to select from. If a query does not experience any events — i.e., it is not "intent-shifting" — then 
the optimal search result is fixed over time; otherwise the optimal search result may change. Only 
10% of the queries are intent-shifting, with at most 10 events per such query. Due to the random 
nature with which data is generated, regret is reported as an average over 10 runs. The event oracle 
is an axis-parallel rectangle anchored at the origin, where points inside the box are negative and 
points outside the box are positive. Thus, if there are two features, say query volume and query 
abandonment rate, an event occurs if and only if both the volume and abandonment rate exceed 
certain thresholds. 

Bandit with Classifier (BWC): Figure |2a) shows the average cumulative regret over time of three 
algorithms. Our baseline comparison is UCBl which assumes that the best search result is fixed 
throughout. In addition, we compare to an algorithm we call ORA, which uses the event oracle 
to reset UCBl whenever an event occurs. We also compared to EXP3.S, but its performance was 
dramatically worse and thus we have not included it in the figure. 

In the early stages of the experiment before any intent-shifting event has happened, UCB 1 performs 
the best. BWC's safe classifier makes many mistakes in the beginning and consequently pays the 
price of believing that each query is experiencing an event when in fact it is not. As time progresses, 
BWC's classifier makes fewer mistakes, and consequently knows when to reset UCBl more accu- 
rately. UCB 1 alone ignores the context entirely and thus incurs substantially larger cumulative regret 
by the end. 

Fraction of Intent-Sliifting Queries: In the next experiment, we varied the fraction of intent- 
shifting queries. Figure |2b) shows the result of changing the distribution from 0, 1/8, 1/4, 3/8 
and 1/2 intent-shifting queries. If there are no intent-shifting queries, then UCBl's regret is the 
best. We expect this outcome since BWC's classifier, because it is safe, initially assumes that all 
queries are intent-shifting and thus needs time to learn that in fact no queries are intent-shifting. On 
the other hand, BWC's regret dominates the other approaches, especially as the fraction of intent- 
shifting queries grows. EXP3.s's performance is quite poor in this experiment — even when all 
queries are intent-shifting. The reason is that even when a query is intent-shifting, there are at most 
10 intent- shifting events, i.e., each query's intent is not shifting all the time. 
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With more intent-shifting queries, the expectation is that regret monotonically increases. In general, 
this seems to be true in our experiment. There is however a decrease in regret going from 1/4 to 3/8 
intent-shifting queries. We believe that this is due to the fact that each query has at most 10 intent- 
shifting events spread uniformly and it is possible that there were fewer events with potentially 
smaller shifts in intent in those runs. In other words, the standard deviation of the regret is large. 
Over the ten 3/8 intent-shifting runs for ORA, BWC, UCBl and EXP3.S, the standard deviation was 
roughly IK, lOK, 12K and 6K respectively. 

Number of Features: Finally, we comment on the performance of our approach as the number of 
features grows. Our theoretical results suggest that BWC's performance should deteriorate as the 
number of features grows. Surprisingly, BWC's performance is consistently close to the Oracle's. 
In Figure |2tb), we show the cumulative regret after 3M impressions as the dimensionality of the 
context vector grows from 10 to 40 features. BWC's regret is consistently close to ORA as the 
number of features grows. On the other hand, UCB 1 's regret though competitive is worse than BWC, 
while EXP3.s's performance is across the board poor. Note that both UCBl and EXP3.s's regret is 
completely independent of the number of features. The standard deviation of the regret over the 10 
runs is substantially lower than the previous experiment. For example, over 10 features, the standard 
deviation was 355, IK, 5K, 4K for ORA, BWC, UCBl and EXP3.S, respectively. 

9 Future Work 

The most immediate open question is whether we could train the classifier faster. One idea is to use 
a more efficient classifier, especially if we can relax the "safety" requirement and somehow recover 
from false negatives. Another idea is to generate labeled samples not only upon positive predictions 
but upon negative ones as well, trading off the regret from additional exploration against the benefits 
of generating extra labeled samples. Finally, it would be desirable to supplement the existing worst- 
case provable guarantees with stronger ones for settings in which the contexts are sampled from a 
"benign" distribution. 

Theoretically, the main drawback of our approach is that we assume the existence of a "perfect 
oracle" — a deterministic boolean function on contexts which correctly predicts whether a temporal 
event has occurred in the current round. It is desirable to extend our results to scenarios in which 
the contexts allow only approximate or probabilistic prediction. Even though such contexts contain 
useful signal, exploiting this signal for our purposes appears quite challenging. In particular, it 
seems to require making the "bandit plus classifier" setup resilient against (infrequent) incorrectly 
labeled samples and perhaps also against (infrequent) false negatives. It should be noted that the 
aforementioned resiliency can potentially lead to large improvements in the present oracle-based 
setting as well, as we might be able to deploy much more efficient classifiers. 

Empirically, the main question left for future work is testing the "bandit plus classifier" approach in 
a realistic setting. The challenge here is two-fold. First, one needs to select which features to use 
for contexts, and verify experimentally how informative they are in predicting the temporal events. 
Second, since gaining access to live search traffic is difficult, one would need to simulate it using 
the search logs, the difficulty being is that the search logs might not have enough data points for 
alternatives that have not been chosen frequently by the search engine. 
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A Details for the proof of Theorem |6tii) 

Claim 5. We can define context sequences {x^} and {a;J}, . . . , {x^} with the following properties: 
(1) each sequence {xj}, when paired with a problem instance Ti, defines an eventful bandit problem 
consistent with all our assumptions, and (2) the sequences {x^} and {x*} agree through the first i 
phases. 

Proof. Let yi, . . . , j/^^ € A" be a sequence of contexts such that yj ^ Sjr{{yi, . . . ,yj-i}) for all 
j = 1, . . . , djr. We know this sequence exists by the definition of djr. Also assume there exists an 
"always negative" context a;^ such that /(.t^) = —1 for all/ e J^ (this assumption is not necessary, 
but is convenient). Let tj be the first round of phase j. 

Define {x^} as follows: let x" = yj for each phase 1 < j < N, and let x'^ = x~ for all other 
rounds. 

For 1 < i < N, define {xl} as follows: let xl — yj for each phase I < j < i, and let xl — x^ for 
all other rounds. D 

Claim 6. {l-q^,^)>T~^/Hl-q^,o) 
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Proof. Throughout this proof, we fix phase i. Define a realization to be a particular sequence of 
outcomes of all random samples from click distributions, as well as all random choices (if any), 
during an execution of algorithm A through the end of phase i. For example, if s = si, . . . , sm 
is a realization, then si might correspond to the click observed in the first round, S2, . . . , S5 might 
correspond to random choices made by the algorithm, sg might correspond to the click observed in 
the second round, and so on. By the chain rule, for any j e {0, i}: 

Prijs] = PriJsi]PrxJs2|si] • • • PrxJsM|si, • ■ ■ , sa/-i] 

For any realization s, let Prj [s^] be the product of terms in the above product that correspond to 
outcomes other than observed clicks in phase i. Let S be the set of realizations in which arm z is 
selected by A less than logT times in phase i. Let Ua^ds) be the number of times in realization s 
that arm a G {y, z} is selected in phase i and payoff c G {0, 1} is observed as a result. Then 

(l-g,,o) = ^PriJs] 
ses 



ses 
< 



EprxoM (i)"-^^' G)"-^^^ (-:^) ^'"^^ (tt^) 

ses 

ses 

ses 

ses 
seS 
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